fix(operator): targeted reindex after pg_resetwal via amcheck#68
Merged
Conversation
dfcf7d2 to
34876f1
Compare
pg_resetwal bypasses WAL replay — it tells postgres to act as if the data dir is consistent without replaying anything. Any index update in flight at snapshot time can leave torn pages, which postgres later surfaces as 'unexpected zero page at block N' when a query happens to hit the corrupted index. Downstream users reported exactly this against a primary key, with the hint 'Please REINDEX it.' A blind REINDEX DATABASE takes hours on the prod-sized DBs (known from prior experience). Instead, use postgres's amcheck contrib extension to do a structural scan: read each valid btree index, verify its invariants, queue only the failing ones for REINDEX. Plus a blind REINDEX of every non-btree index (amcheck only covers btree, non-btree are usually a tiny fraction of total index size). Reuses the existing post-restore reindex infrastructure (the /pgdata/needs-reindex marker, the background hook in the postgres container startup, the readiness probe gate). Adds a second marker /pgdata/needs-reindex-all set by every pg_resetwal -f invocation in the operator's two-stage fallback path. The runtime hook handles both flags, with the -all branch supplanting the narrow locale-only branch when both are set.
34876f1 to
fa961fd
Compare
passcod
added a commit
that referenced
this pull request
Jun 10, 2026
…SE CONCURRENTLY The amcheck-driven smart pass introduced in #68 hits the same family of postgres-internal pathology that wedges other vanilla DDL on the prod dataset — bt_index_check itself burns 100% CPU forever on specific indexes (observed via pg_stat_activity: state=active, empty wait_event, query_start growing linearly with wall clock for 4+ minutes on a tiny system catalog index). The smart pass is unusable on this data. Two changes: 1. Drop bt_index_check; revert the needs-reindex-all branch to a blind REINDEX DATABASE per user DB, CONCURRENTLY on PG ≥ 12 so clients can keep using the old indexes during the rebuild and the atomic swap makes corruption disappear without downtime. REINDEX reads the heap and rebuilds the index from scratch — a different code path from amcheck (which reads corrupt index pages directly) — so it isn't subject to the same wedge. Slow on prod-sized DBs (hours) but makes progress. 2. Drop the readiness probe's gate on /pgdata/needs-reindex-all. With the reindex taking hours, gating readiness here trips the operator's 30-minute deployment_ready_timeout and the restore is marked Failed before postgres even has a chance to come up. The probe still gates on /pgdata/needs-reindex (locale-only, finishes in seconds) since that's small and proven. Trade-off: clients hitting a not-yet-reindexed corrupt index in the window between pod-Ready and reindex-complete get the explicit "unexpected zero page" error from postgres. With CONCURRENTLY the window is narrow (queries hit the old index until the atomic swap) and clients can retry. Strictly better than the alternative (permanently failed restore, replica stuck indefinitely on the previous Active). The pre-#68 behaviour for the locale-only path is unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖
Summary
When the restore init script falls back to
pg_resetwal -f(the operator's two-stage WAL-recovery-failure path), it touches a new/pgdata/needs-reindex-allmarker. The main container's background-reindex hook picks up that marker and runs a targeted reindex via theamcheckcontrib extension: scan every valid btree index, REINDEX only the ones that fail the structural check, plus a blind REINDEX of every non-btree index. All before the readiness probe lets traffic in.Why not REINDEX DATABASE
A blind REINDEX DATABASE takes hours on the prod-sized DBs (known from prior experience — the prod indexes are many). Most pg_resetwal cases corrupt zero or a handful of indexes; rewriting everything is wasteful.
amcheckreads each btree index page and verifies its invariants — including the "unexpected zero page at block N" case that prompted this. Cost: ~index-size of disk reads. For a clean snapshot it finds nothing and reads only the indexes; for a corrupt one it identifies exactly which need rebuilding.Caveats:
amcheckonly covers btree. GIN / GiST / BRIN / hash indexes get a blind REINDEX (typically a small fraction of total index size; safer than skipping).bt_index_check(light variant) takes onlyAccessShareLockand doesn't block writes.bt_index_checkisn't exhaustive — there are corruption shapes it misses. But it does catch zero-page issues, which is the reported failure mode.Shape
postgres_single_or_resetwal(init script) nowtouches/pgdata/needs-reindex-allat both pg_resetwal invocation sites (detected-WAL-signature branch + last-resort branch).needs-reindex-allpresent → enable amcheck, scan btrees, REINDEX corrupt btrees + all non-btrees. Clears both markers (the locale-only set is a strict subset).needs-reindexpresent → existing collation-only loop.Tests
Three new unit tests:
pg_resetwal -f "$PGDATA"invocation in the init script is paired withtouch /pgdata/needs-reindex-all.bt_index_check, collects non-btree indexes, runs targeted REINDEX, and explicitly does not useREINDEX DATABASE.